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Abstract 


A family of gradient descent algorithms for learning linear functions in an online setting is 
considered. The family includes the classical LMS algorithm as well as new variants such as 
the Exponentiated Gradient (EG) algorithm due to Kivinen and Warmuth. The algorithms 
are based on prior distributions defined on the weight space. Techniques from differential 
geometry are used to develop the algorithms as gradient descent iterations with respect to 
the natural gradient in the Riemannian structure induced by the prior distribution. The 
proposed framework subsumes the notion of “link-functions”. 

Keywords: Gradient descent, exponentiated gradient algorithm, natural gradient, link- 
functions, Riemannian metric 


1. Introduction 


The LMS (least mean-square or Widrow-Hoff algorithm) (Clarkson, 1993) is very widely 
used in signal processing and various learning problems (Duda, Hart, and Stork, 2001). 
Recently some interesting variants of this algorithm including the Exponentiated Gradient 
(EG) algorithm have been developed by Kivinen and Warmuth (1997). The EG algorithm 
has been shown (both theoretically and experimentally) to have better performance in sit- 
uations where the target weight vector is sparse. The theoretical framework used in that 
analysis is also relatively new (the so called mistake-bounded framework). An alternative 
analysis (Hill and Williamson, 1999, 2001) (analogous to more traditional ways of view- 
ing LMS) essentially reproduces the conclusion that EG works well with a sparse target. 
More recently Grove et al. (1997), Warmuth and Jagota (1997), Kivinen and Warmuth 
(2001, 1998), Gentile and Littlestone (1999) and Gordon (1999a,b) have analyzed a range 
of general families of gradient descent algorithms inspired by the EG algorithm for both 
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classification and regression problems. There are several different viewpoints taken in these 
works, analyzing the algorithms in terms of Bregman divergences, matching loss functions 
and conjugate priors (for example). In general, such algorithms perform better on a partic- 
ular class of problems related to prior knowledge of the target weight vector. It is natural to 
study the relationship between prior information and numerical learning algorithms in order 
to design effective learning algorithms for new classes of problems. Kivinen and Warmuth 
(1998) introduced the concept of ‘link functions’ in order to generalize the derivation of 
the EG algorithm. An alternative is the natural gradient learning framework developed by 
Amari (1998). Such a learning algorithm is well motivated as a stochastic gradient descent 
algorithm derived with respect to the maximally non-informative (or Fisher) information 
metric; see Amari (1985), Barndorff-Nielsen (1988), Murray and Rice (1993), Douglas and 
Amari (2000). These algorithms correspond to assuming a maximally non-informative (or 
Jeffery) prior distribution on the target weight vector. 


In this paper we present a new method to derive stochastic gradient algorithms that is 
closely linked to Bayesian prior information. The approach taken is to link prior distribu- 
tions to a Riemannian structure on weight space that is called a preferential structure. In 
the case of product prior distributions the connection between a preferential structure and a 
Bayesian prior distribution may be made relatively precise. The framework proposed leads 
to a constructive method to design stochastic gradient descent algorithms that are adapted 
to perform well under certain prior assumptions. Once a stochastic gradient descent algo- 
rithm has been designed for a certain application the numerical cost of its implementation 
is of the same order as that of the LMS algorithm. A theorem is proved showing that 
(subject to some technical conditions) a stochastic gradient algorithm designed according 
to the framework proposed is, on average and when the prior assumptions hold true, lo- 
cally the most efficient stochastic gradient descent algorithm. The preferential structure 
proposed provides an interpretation of the EG algorithm in terms of a product prior dis- 
tribution that heavily weights zero against non-zero weight vector entries. The intuition 
associated with the prior weighting fits with the observed numerical advantages of the EG 
algorithm. Using the framework of prior distributions and preferential structures several 
new numerical learning algorithms are derived based on prior distributions of particular 
interest. Simulations are given comparing relative performance of the algorithms proposed. 
The proposed framework also provides an interpretation of the role played by link functions 
in the development of Kivinen and Warmuth (1998). The key contribution of the paper is 
in providing a new tool in the optimization of stochastic gradient descent algorithms for 
real world applications. 


Section 2 of the paper reviews the learning problem considered and relates the approach 
taken to the previous work of Amari (1997, 1998). In Section 3 general preferential struc- 
tures are motivated and defined. Section 4 concentrates on the special case where the prior 
distribution in parameter space is a product distribution. The preferential (natural) gra- 
dient descent algorithm is introduced in Section 5. In Section 6 existing algorithms (such 
as the EG algorithm) are interpreted in terms of prior distributions and preferential struc- 
ture. Section 7 shows the connection between “link functions” and the proposed framework. 
Some examples of different algorithms are presented in Section 8. An analysis of local per- 
formance of different learning algorithms is provided in Section 9. Results of simulation 
experiments for the examples considered in Section 8 are given in Section 10. 
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Finally, in order to aid readers unfamiliar with Riemannian geometry, in Appendix A 
we have presented a gentle introduction to the basic ideas used in this paper. 


2. Problem Formulation 


In this section the framework for the learning problem considered is presented. The con- 
ceptual difference between the proposed approach and that based on recent developments 
in statistical geometry (cf. Amari (1998)) is outlined. The gradient descent (or LMS) 
algorithm and the exponentiated gradient algorithms are presented. 

Consider the class of linear model relationships with inputs in R and outputs in R; 
the set of maps 


r= (w,r), £ ERY, 


where w € RY and (w,x) = ww. For a sequence of data {x1, £2, ...} assume that there are 
associated outputs 


Yk = (Wa, Lk) + Nk, (1) 


generated by an unknown “true” system x +> (wx, £) perturbed by noise ng. Depending on 
the problem setting and the type of analysis to be attempted, different assumptions can be 
made on the noise. We do not make any assumptions here, but merely mention that there 
usually is noise because it will ultimately govern the tradeoff between convergence speed 
of the algorithms considered and their steady-state error (how much the estimates “jiggle 
about” the true weight vector wx). 

The problem considered is to learn the unknown w, for the incoming data stream 


Sp := { (21, Y1), (22, Y2), <- - , (Les Yk) }- (2) 


In this paper we make no assumptions about the sequence (x) except in Section 9. (This is 
because apart from in that section we are not actually making a performance analysis.) This 
problem is known as a supervised learning problem since to obtain the training sequence 
Sk one needs both a set of trial data points {x,} and the measured outputs {yx} which 
would in practice be supplied by a “supervisor” (human or machine). In the presence of 
the noise 7, this problem becomes one of parametric statistical inference, that of finding 
the statistical model pwu, which best explains the observed data. 

A learning rule is a method of determining a sequence of estimates {wy ,we,..., Wk}, 
where wz depends on Sk, which “learns” the parameter wą, that is we > w,. Many 
practical learning algorithms proposed in the literature are based on stochastic gradient 
descent schemes; see for example Fine (1999), Hassoun (1995), Duda et al. (2001). Let 


Uk = (Wk, Lk). (3) 


denote the estimated output. The instantaneous loss function 


Eyes tu) = Fe — te), (4) 
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measures the mismatch between the training sample yz and the estimated output Yg. 
The Gradient Descent GD learning rule updates the present estimate wg in the direction 
of steepest descent of the cost L(y, Gx) 





OL 
w = Wk — Sk—— (Yk, Ik). 5 
k+1 = Wk ka, uk Yk) (5) 
Here 2 (yi, Dk) is the column vector of partial differentials oe, fori = 1,...,N. The 


scalar s € R is a step-size scaling factor (or learning rate) which is chosen to control 
how large an update to wọ is made. In most current learning applications the update 
is kept constant. The noise in (1) will locally perturb the convergence of the stochastic 
gradient algorithm, although, as long as the measurements are drawn from a non-degenerate 
distribution, the estimate wg converges asymptotically to a neighbourhood of ws (Solo and 
Kong, 1995). More specifically, it can be shown that under mild assumptions the stochastic 
gradient descent algorithm will follow a trajectory “close” to the trajectory of an “averaged 
equation”. The exact meaning of “close” depends on the details of the analysis (Solo and 
Kong, 1995), but the important point is that they are closer as the step size gets smaller and 
the noise gets smaller. Thus it is common practice in analysing or developing stochastic 
gradient descent algorithms to pretend at first there is no noise present, but afterwards 
check performance with noise. This is exactly the route we will adopt in the present paper. 
Thus we we assume that the measurement 


Yk = (We, Lk) 


is unperturbed by noise. This assumption simplifies the presentation of the principal contri- 
bution of the paper: a geometric interpretation of a preferential structure on the parameter 
space. 

Recent developments in statistical geometry (Amari, 1985, Murray and Rice, 1993, 
Barndorff-Nielsen, 1988, Douglas and Amari, 2000) are based on providing a geometric 
interpretation of the problem of parametric statistical inference, the problem of computing 
one model among a parametrized set of statistical models which best describes an observed 
set of noisy measurements. The underlying paradigm is to find an intrinsic geometry for the 
problem (based on statistical precepts) which is independent of the particular parametriza- 
tion of the statistical model. The key developments in statistical geometry revolve around a 
geometry on the space of conditional probabilities derived from a likelihood function and the 
associated affine action of the set of random variables acting on the space of probabilities. 
The Riemannian metric used is typically the Fisher metric (also known as the maximally 
non-informative metric). This metric is derived from cross-correlation of random variables 
and leads to an associated prior distribution known as the Jeffery prior or maximally non- 
informative prior. The more sophisticated geometry developed (involving parallel transport 
and affine connections) is related to invariance of various statistical divergence measures 
under the action of covariant differentiation. 

In the present paper we take a different approach. We assume that there is significant 
prior information available and that this information is coded directly in terms of the 
weight vectors for the learning problem. As a conseqeunce, the given parametrization (in 
this case the linear weight vector) of the problem contains important information. It is 
clear that the maximally non-informative geometry (generated by the Fisher metric) is not 


314 


PRIOR KNOWLEDGE AND GRADIENT DESCENT LEARNING ALGORITHMS 


a suitable structure for analysis of the learning problem considered. To further simplify 
the development we restrict our analysis to the deterministic learning problem where the 
only statistical information that needs to be considered is the prior information. This 
perspective on the problem considered is quite different from that proposed by Amari (1998) 
and appears to provide a means to understand a number of known learning algorithms in a 
generic manner. 

Variations on the classical gradient descent or LMS algorithm (5) have been recently pro- 
posed as prototype learning algorithms. Of particular interest is the exponentiated gradient 
(EG) algorithm (Kivinen and Warmuth, 1997) 


A OL p 
wk+1 = diag(wg) exp (= Sk y H Îk)), (6) 


where the exponential function exp. of a vector is the vector of the exponentials of the 
separate entries, and diag(w,) is the diagonal matrix with diagonal entries given by the 
entries of wg. Thus the ith entry of wz41 is given by 


YA a OL ^ 
Wk41 = Wk CXP Sky 7 (Yeo Dk) : 


It is known that the EG algorithm performs better than the GD algorithm if the true 
parameter w, contains relatively few non-zero entries (Kivinen and Warmuth, 1997). It 
is not surprising that it performs worse than the GD algorithm if the true parameter has 
many non-zero entries. Thus, to choose the most efficient learning algorithm for a given 
application one would exploit any prior knowledge available regarding the nature of the true 
weight vector to choose between the GD or the EG algorithm. 

The above discussion leads one to pose the question: How is it possible to use prior 
knowledge about the true weight vector in a given application to design a more efficient 
learning algorithm? The remainder of the paper is devoted to presenting an approach to 
answering this question. Before entering into the technicalities it is worth mentioning how 
the proposed results may be used in practice. For a real world problem involving a class of 
linear model relationships it is often very difficult to understand and model prior information 
based on physical arguments. However, it is generally a simple, if time consuming, matter to 
acquire data in real world operating environments. An estimate of the distribution of true 
weight vectors can then be inferred from the data by running a standard gradient descent 
algorithm and recording the weights it converges to and then estimating a distribution 
over those weights. The experimental prior distribution can then be used as the basis of 
an optimized design using the theory presented in the sequel. In particular, if the prior 
distribution obtained is a product distribution the preferential structure is given by (13) 
and the algorithm is given by (40) or an approximation of this equation. An example of this 
determination and use of an “empirical prior” has been presented by Martin et al. (2001). 


3. Preferential Structures and Prior Knowledge 


In this section an approach for encoding prior knowledge into learning algorithms by im- 
posing a Riemannian geometry on parameter space is proposed. In the context of learning 
theory we propose to call this geometric structure a preferential structure. 
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3.1 Riemannian Metrics 


A Riemannian metric on RY is a bilinear, positive definite inner product on each tangent 
space Tou R” = R which varies smoothly in w. We denote a metric by 


CTR SP RY => RY 


that may be represented explicitly in the natural co-ordinates on RY by a positive definite 
matrix Gw > 0 at each point. Thus, for tangent vectors X,Y € Tu RV 


(X, Yw = XAG GY, 


and Gw > 0 is a smooth matrix function on RX. At the point w € R the metric can be 
thought of as a way to measure the length of vectors and angles between vectors in T,,R%. 

A Riemannian metric (-,-),, on each T,,R% can be used to measure curve length on RN. 
Let y : [0,1] > RY be a smooth curve on RY. Then the length of y is defined to be 


1 
Ly) = | (ADAM andr: (7) 


This extends to a classical metric 6(u, w) measuring distance between two points u, w € RY 
via the infimum 


iuw) i= int LO), (8) 


where 
Hu, w) := {7 : [0,1] > R” : 7(0) = u, yA) = w}. 


That is, 6(u,w) is the length of the shortest curve connecting u and w. To avoid confu- 
sion between Euclidean RY we will use RY to denote R equipped with a non-Euclidean 
geometric structure. 


3.2 Encoding Prior Information 


By a linearization locally around any point w(0), a Riemannian metric may be written 
Gz := diag(uñ, --- uy) + O(||2 — wll) 


for z € U a neighbourhood of w and where u; > 0. Choosing two end points w(0) and 

w(1) that only vary in the ith component it is easily verified that the shortest length 

curve between these points is the straight line lying along the co-ordinate axis connecting 

them. The length of a curve lying along a co-ordinate axis w' is simply ju;|w’(0) — wt (1)| = 

6(w(0),w(1)). Thus, taking a unit length step in direction wê with respect to the new 
1 


geometry translates into a scaled step of length T in the original co-ordinates. That is 


6(w(0),w(1)) = mlw’ (0) — w0)|=1 & w0) wo — 
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Suppose now that one has some prior knowledge that indicates w' is likely to be a 
fairly good estimate of the ith component of the true parameter whereas w/ may be a 
poor estimate. Then we may choose the metric G, with u > pj so that a unit step 
(with respect to the new metric) in direction w’ results in a relatively small change in the 
Euclidean distance while a unit step in direction w results in a significant change Euclidean 
distance." Thus even if the instantaneous cost indicates large changes should be made in the 
direction wÎ (perhaps due to noisy data), the prior knowledge (in the form of the chosen 
metric) would ensure that only small steps (relative to the Euclidean metric) are made. 
Intuitively, if our prior knowledge is good and can be coded in this manner then a learning 
algorithm derived with respect to this new geometric structure should perform better than 
one which does not incorporate the prior knowledge in any manner. The insight provided 
by this example is directly applicable to infinitesimal learning steps at a point w € RY since 
Gw is symmetric and can always be diagonalized locally (to O(w) terms in a neighbourhood 
of w). In Section 5 we show how to generate practical learning algorithms that respect the 
geometric structure generated by an arbitrary Riemannian metric. 


Definition 1 Consider a learning problem of the form outlined in Section 2. A preferential 
structure is a Riemannian metric on parameter space, called the preferential metric, that 
encodes certain prior knowledge for the learning problem. A learning problem along with a 
preferential structure is said to have a preferentially structured parameter space. 


This definition is clearly inadequate (so far) as a technical tool since it does not provide 
any quantitative manner to generate a preferential structure. In Section 4 a connection is 
drawn between product prior distributions and diagonal preferential structures that provides 
a quantitative connection for a large class of interesting problems. 


Remark 2 The definition of preferential structure proposed (Definition 1) makes no ez- 
plicit reference to the underlying statistics of the problem (e.g. via the likelihood function). 
The information geometric structure presented by Amari (1985), Murray and Rice (1993), 
Barndorff-Nielsen (1988) satisfies Definition 1 given prior knowledge of the noise charac- 
teristics of the measurements and no prior distribution on the weights (Amari, 1998). The 
class of algorithms considered in the present paper follows from the assumption of deter- 
ministic measurements and a prior distribution on the weights. 


4. Product Distributions and Diagonal Preferential Metrics 


In this section we consider the situation when prior knowledge is quantified via a Bayesian 
prior probability distribution. There are numerous arguments (Robert, 1994) why this is 
a “good” way of encoding prior knowledge. Our goal in the present section is to relate a 
prior probability distribution over the parameter space to a preferential metric. 

We focus on the particular case where the prior is a product distribution in which case (as 
we show) there is a unique preferential metric that naturally recodes the prior distribution 
into a preferential structure. The actual application of these structures to gradient descent 
learning algorithms is considered in the next section. 





1. An illustrative example of this effect is given in Figures 7 and 8 for the particular geometry of the EG 
algorithm. 
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Suppose a prior distribution (with density 6: RN > R+) for the true parameter w, is 
given and has the form 


N 9 
olw) = | [ auh), (9) 
121 


where each ¢; : R > R+ is itself a probability density for wê. Thus given a set Q C R” then 
the probability that w, € 9 


Pie | danan: (10) 
Q 
Now suppose there exists a preferential metric (represented by Gw) such that 


det(Gw) = o(w)?. (11) 


Once again we can take a set Q C R” (Lesbegue measurable) and compute the area of Q 
with respect to the preferential metric. The area is given by the integral (Boothby, 1986, 
pg. 240) 


A (Q) = [ EET (12) 


(the p denotes “preferential”.) That is, the volume element in the new geometry is scaled 
by the factor \/det(G,,) with respect to Lesbegue integration in the co-ordinates w. Thus, 
due to the assumed form of Gy, 


A,(Q) = | swt = P(w, € Q). 


In this sense area with respect to the preferential metric is equivalent to density with respect 
to the p.d.f. 9. 

If ¢ is large in a region then the associated metric should also be large, corresponding 
to large relative area of the region with respect to the preferential structure. Consequently, 
unit step updates (with respect to the preferential structure) in a gradient descent learning 
algorithm should translate into small updates of the parameters w. For example, even if the 
instantaneous cost indicates large changes should be made to the present estimates (perhaps 
due to noisy data), the prior knowledge (in the form of the preferential structure) ensures 
that only small steps (relative to the Euclidean metric) are made in areas corresponding 
to uniformly high p.d.f. 9. If the actual true parameter is such that it causes the descent 
steps to continue to force a change in an unlikely direction with respect to the preferential 
structure then convergence of the parameter wk — w, will be considerably slower than if 
the preferential structure was not present. This corresponds to having made the wrong 
prior assumptions about wx. 

We will show below (section 9) that the above intuitive justification is sound in a more 
precise sense: modification of a gradient descent algorithms by a preferential metric satis- 
fying (11) leads to an algorithm with faster local convergence. 
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4.1 Freedom in choosing Gu from ¢ 


Equation 11 will be satisfied by arbitrarily many Riemannian metrics for a given p.d.f. ¢. 
However, for a product distribution (9) we propose the particular preferential metric 


p(w)? Q aut 0 
Gy i= ee : (13) 
; i 0 
0 ae 0 bn (wA)? 


Certainly Gw satisfies (11). Moreover, it generalizes the product distribution structure of 
@. To see this one must consider the probability of sets (events) Q which have zero measure 
in RV. We will restrict our attention to sets Q which are embedded submanifolds of RY. In 
general, let M be an embedded submanifold of RY and let Q C M be a subset of M. With 
respect to the p.d.f. ¢ on RY then for any event Q of this form P(Q) = 0. However, the 
conditional probability P(Q|M) should not be zero. Moreover, this should be related to an 
area integral on M relative to the geometric structure inherited on M via the embedding 
M — RN from the preferential structure. 

We will say that a preferential structure agrees with a prior distribution @¢ if for any 
embedded submanifold of RN then 


P(Q\M) = Ay (N) 


where 


AM (Q) = | yaad 


where GM is the metric projected onto the manifold M; see (15) below. If this is true then 
the concept of length defined by the preferential metric (which is important for generat- 
ing learning algorithms) agrees with conditional probability distributions in the particular 
case of 1-dimensional subspaces. Computing conditional probabilities for arbitrary lower 
dimensional sets is difficult and we shamelessly dodge this problem in the next lemma by 
restricting our attention to embedded manifolds lying orthogonal to the co-ordinate axis 
and exploiting the structure of the product measure. 


Lemma 3 Let ¢ be a product p.d.f. of the form (9) and let Gy be the preferential metric 
given by (13). Then for any embedded manifold M — RN which is orthogonal to a subset 
of the co-ordinate azes and any subset Q C M one has 


P(Q\|M) = A¥ (9). (14) 
Furthermore, Gy is the unique metric for which this identity holds. 


Proof If an m-dimensional manifold M © R” is orthogonal to the co-ordinate axes 
then locally one can choose m co-ordinates z = (w,... ,w’)? € R™ which act as local 
co-ordinates on M. Note that M is characterised locally by holding the remaining N — m 
co-ordinates constant (Boothby, 1986, pg. 75). Due to the nature of the product distribution 
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the conditional p.d.f. on M is given by integrating out 0 with respect to the N — m ‘non- 
active’ co-ordinates to obtain 


O(a yw) = II di, (w) 
k=1 


Thus, written in terms of the local co-ordinates z the probability of Q conditioned on M is 
the Lesbegue integral 


P(Q|M) = f oaz 


We deliberately avoid the rigorous approach to dealing with conditional probabilties when 
the conditioning event has probability is zero; see for example (Shiryaev, 1996, Section II.7) 
or (Hoffmann-Jorgensen, 1994, Chapter 10). Chang and Pollard (1997) have presented an 
alternative formal approach. 

Think of the local co-ordinates as maps into R via the correspondence w’ : R > RY, 


wezi): (0,... ,0,2*,0,... ,0)7 
where z* occurs in the ith entry of the vector. Thus, the local co-ordinate map around a 


point wo E€ M can be written as a vector function wy : R” > RN 


wae a) = wi (24) +o H win (2) + wo, 


We write wm = (wh,,..., wh)? € RY for the co-ordinates of wy € RN. The induced 
preferential metric on M with respect to the local co-ordinates is 
pi (z+)? (eee 0 
Cait Gag = 9 (15) 
: n 0 
0 ae 0) bin (2)? 
P 
Here dwm E€ RVYT with entries (dwm)pq = Mi p=1,...,N,q=1,...,m. Now observe 


that 
det(G™) = II pir l = (2): 
k=1 


It follows from the definition of area on a Riemannian manifold (12) that P(Q|M) = AM (Q). 

To show uniqueness assume that Gwu > 0 (with entries gij) is an arbitrary preferential 
structure that satisfies (14). Observe firstly that the diagonal elements of Gw are uniquely 
defined by considering the case of one-dimensional manifolds orthogonal to the co-ordinate 
axis. This follows since the induced preferential metric is just the diagonal element gi; while 
the conditional probability is ¢™ (w’). 

Consider an arbitrary two-dimensional submanifold orthogonal to the co-ordinate axis 
with local co-ordinates wê and wf. Then 


eee Jij ji 
Iji Jjj 
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To ensure that (14) holds on all subsets Q then locally on M one must have ¢™ (wt, wł)? = 
det(GM) where PM (wt, wi) = ¢;(w')¢;(w’) is the product measure induced on the subspace 
spanned by the 7 and j co-ordinates. Computing the determinant yields 


det(Ge) = gügjj — 915954 
= $;(w")?b;(w’)? — g3 


= o” (w, w)? — gij 


where we have used symmetry of Gw as well as the identification of the diagonal elements 
with $;(w’)?. Since M and Q are arbitrary it follows that gij = 0 for all i Æ j. E 


Remark 4 It should be noted that except for some changes in notation the development 
undertaken above is valid for any p.d.f. of the form 


N 
ow) = || di(w) 
i=1 


where each ġi(w) is a probability distribution for wt on R but may depend on all the variables 
w. This is important since product measures ¢;(w) := ¢;(w*) induce a preferential structure 
which is isometric to Euclidean space (cf. Section 7) whereas a more general product distri- 
bution, even though it induces a diagonal preferential metric, will not induce a Euclidean 
preferential structure. 


4.2 General Improper Priors 


In this subsection we explore the implications of ¢(w) being an improper distribution (i.e. 
one for which fo ¢(w)dw # 1 and may in fact be infinite). There are two key reasons for 
doing this: 


1. The two main example algorithms we consider (standard Gradient Descent where 
(w) = 1 and the EG algorithm where ¢(w) = 1/|w|) correspond to improper priors; 


2. In practice only a local comparison between performance of stochastic gradient descent 
learning algorithms is possible (a global comparison seems analytically intractable; cf. 
Section 9) and consequently only conditional probabilities, in a neighbourhood of a 
given point are considered. Thus, the question of improper or proper priors does not 
truly affect the conclusions of the paper. 


The above development has been undertaken for proper probability distributions @ and 
has looked for Riemannian metrics which correspond to these probability distributions. 
However, one may look at the problem in the opposite direction and ask the question, given 
a Riemannian metric g(w), then does the distribution 


p(w) := vdet(g(w)) (16) 
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mean anything in a probabilistic sense? Below we will argue that it does. Note that for 
an arbitrary Riemannian metric Gw the distribution ¢ generated is not in general a proper 
probseeability distribution; it is highly unlikely that 


A= V Gydw = 1. 
RN 


It can be interpreted as an improper distribution. Bayesian statisticians often deal with im- 
proper prior distributions; the most obvious one is the uniform measure over R. Whilst there 
are undoubtably technical difficulties arising from such improper priors (see e.g. Robert, 
1994), one can generally get away with their use as long as the posterior distributions are 
well defined. Furthermore, it is possible to reformulate the basis of Bayesian statistics 
to handle improper priors in a rigorous manner (Hartigan, 1983) albeit at the expense of 
considerable technicalities. 

Of course if A is finite then simply rescaling Gw provides a direct analogy of the de- 
velopment given above. However if A is infinite then a different argument is needed. If 
one wanted to compare the one event Q C RY with respect to a uniform distribution (an 
improper distribution on R) or a non-uniform distribution ¢ then one could look at the 
ratio 


w)dw 
Rig) (Q) = bant 


Thus, R(¢ı,%2)(®) is a ratio of likelihoods rather than an absolute probability. As long as 
the events considered always have finite non-zero weight with respect to the set Q, then 
the ratio Rig, 42)(Q) is well defined. We will call distributions 9 of the form (16) relative 
probability distributions. Such a distribution should always be thought of relative to a second 
distribution; in the sequel unless otherwise mentioned the relative distribution will always 
be the uniform distribution. Thus, in this context the ratio of @(w)/1 can be thought of as 
the relative likelihood of the point w compared to the uniform distribution. We shall see 
in subsection 7.1 that rescaling ¢(w) by multiplication by a scalar is equivalent to a simple 
change in the step size used. 

The machinery associated with improper distributions and relative probability distri- 
butions is natural in the analysis undertaken in this paper since the basic GD learning 
algorithm can be thought of a being associated with the uniform prior distribution. This 
can be seen heuristically by observing that the direction of update for the GD algorithm is 
totally unbiased by any modification of the direct steepest descent direction. In a proba- 
bilistic sense this is saying that w, € RY is equally likely to be any point in RV and hence 
the best estimate of w, is based on minimizing the loss associated with the latest received 
data. By contrast, the EG algorithm update step is approximately taken in the direction 
of steepest descent for very small update steps, but the larger the update step then the 
more the distortion of the exponential tends to change the next estimate. We interpret this 
as taking account of prior information in the update step. (An explicit form for the prior 
associated with the EG algorithm is determined in section 6.2). 

In the sequel we will always be dealing with the performance of a given algorithm relative 
to the performance of some other algorithm. We will relate this relative performance back to 
relative probability densities and then to certain associated Riemannian metrics. Since the 
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entire analysis is relative we will choose to adopt the following premise:? The GD algorithm 
is the “best” or most approrpriate learning algorithm given that one has no prior knowledge 
of the true parameter. The GD algorithm is associated with the uniform improper prior and 
the Euclidean preferential metric. 


5. Learning Algorithms on Preferentially Structured Parameter Space 


In this section a method for generating learning algorithms is proposed based on a given 
preferential structure. 

Both the GD and EG learning algorithms (5) and (6) are stochastic gradient descent 
algorithms and use only first order differential information of £ and a step size (learning 
rate) są at each update. Consider a general learning algorithm of the form 


OL 
Wk+1 = F (sx, Fu Vr) (17) 





where F : R x RY x RY — RN. To make the learning algorithm sensible one would expect 
that if either są = 0 or SE (Uk, Ok) = 0 then F(szg, 3E wp) = wę and that F is a locally 
continuous (or even differentiable) function of its arguments. If F is differentiable in the 
first variable then 


Z OL 
I(T) = F(7, Ja wk)» 


is a C! curve in RV which passes through wg for r = 0. This leads one to study the class 
of curves 


Y(wr,Ve) : [0, Sk] T RN 


such that Yaw, vp) (0) = Wk, wp, vp) (0) = Vk and Vp is a function of the derivative information 
oe 

Ignoring for the moment the question of how to choose są and Vk, then one may ask 
exactly what is the best “curve” 7, y,)(-) to choose given a known preferential structure. 
For stochastic gradient descent algorithms, the aim is to converge as fast as possible to a 
neighbourhood of w, and then stay there. Setting wk+1 = (wy,Ve) (Sk) for fixed są and Vk 


then one would like to maximise the distance 
O( We, Wk+1) 


taken at every step (cf. (8)) measured relative to the preferential structure! 





2. The term “best” in this definition is certainly dependent on the context of what constitutes a learning 
algorithm. Certainly if one were to consider simply the optimization problem of minimizing the loss over 
all target parameters then some more sophisticated update method (for example a Newton update) would 
tend to display better performance. Here we will restrict ‘learning algorithms’ to be effectively the class 
of algorithms which we generate in the sequel. Though this may appear to be a circular definition it is 
effectively the set of linear descent algorithms with respect to general Riemannian geometry. Only direct 
first order derivatives of the cost are used to generate the update information. Further modification of 
the update step is entirely due to prior information. The key point is not which sense GD is best in, but 
rather how we can modify GD simply taking it as the starting point — the appropriate algorithm for a 
uniform prior. 
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Given that the vectors Vą and the scaling factors są are chosen together to guarantee 
the learning algorithm is well behaved (for example in the sense that the sequence {w;} will 
converge to wx for reasonable data samples) then the curve y should be chosen to maximise 
the distance traveled in the ‘direction’ Vp. By measuring distance relative to the preferential 
structure, prior information is directly incorporated into the update step. 

To derive a curve that generates an efficient learning algorithm it is important that the 
length of the curve is directly related to the step size są and to the size of the vector Vp. 
This is natural for the step size since it is the path length parameter of the curve. However 
to ensure that length of the update curve is properly related to the size of the vector Vi 
then it is necessary to further require that the update curve y evolves at a constant speed 
with respect to the preferential metric: 





J (17), 107)) = (Vier Vidor 7 2 0. (18) 


By this argument, the ‘best’ curve to choose for the purpose of generating a learning algo- 
rithm is one that satisfies (18) whilst maximising (wg, Wg+1) for given s and Vg. Thinking 
of the question in reverse, then given two points wz and wk+1 one is searching for a curve 
y of minimum length and constant velocity (with respect to the preferential structure) that 
connects the two points. Such length minimizing curves on a general Riemannian manifold 
are known as geodesics and are the analogues of straight lines in Euclidean space. 


5.1 Geodesics 


Geodesic curves on a Riemannian manifold can be defined as the solution of a ODE (Or- 
dinary Differential Equation) in local co-ordinates which essentially ensures straightness of 
the solution curve with respect to the metric (Lee, 1997, pg. 68). 

Denote the ijth entry of Gw by gi; and the ijth entry of G3! by g’? where the base 
point w of gij (resp. 97) is inferred from context. Define 


N 
ot ks [Jsi Jij 9955 
IA Sh 1 
T 2 I ðwİi Ow’ j ðw? ty) 


s=1 





The functions r$, are known as the Christoffel symbols (Boothby, 1986, Lee, 1997). A 
geodesic curve y := Y(w,,Vv,)(7) satisfies the set of coupled second order ODEs (Lee, 1997, 


pg. 58) 





da D a d dy! 
yA =0 20 
dr? 2 ‘I dr dr , (20) 
with initial conditions 
dy 
0) = —=YV,. 
y( ) Wk, dr k 


Uniqueness and well definedness (at least for small 7) follows from the classical theory of 
ODEs. 
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Generating the geodesic requires knowledge of tangent direction Vg. It seems natural to 
choose V; equal to the derivative oe Formally, however, this derivative is not actually an 


element of the tangent space of RY. Rather, D,,£ = ( ab)? is the differential of £ and is 
a row vector (co-tangent vector) and not a column vector (tangent vector). There is a one 
to one correspondence between tangent vectors V € T RY ~ RX! and cotangent vectors 
W e TRN ~ R!*N induced by the Riemannian metric via the unique correspondence of 


linear maps 
(V,X)y=WX, for any X € TRY ~ Rò”! 


For W = D,,4, the corresponding tangent vector is known as the gradient and is given in 

local co-ordinates by 
OL 
radl = G71. 21 

g © Du (21) 

When the metric is simply the identity matrix then one obtains the classical Euclidean 

gradient grad = oe As Amari (1997, 1998) has shown (in a slightly different setting) 

there are advantages to using Vk = —grad& where the gradient is taken with respect to the 

preferential structure. Of course the negative Euclidean gradient -9E will always provide 

a descent direction for the cost £ and will generate a sensible learning algorithm. Indeed, 


for any positive definite matrix Q > 0 


196 


V; = -Q7 
k ðw’ 


generates a descent direction. The general form of the learning algorithms studied in the 
remainder of the paper may now be presented. 

Let £ be an instantaneous loss function associated with the learning problem given in 
Section 2. Let Gy be a preferential metric and let sẹ be a sequence of scalars which can be 
thought of as the effective learning rate. Then the learning algorithm studied is given by 





Wk+1 = YVwr, —grad£) (Sk); (22) 


where Yw, graa) İS a geodesic curve with respect to the preferential structure (confer (20)). 


6. Preferential Structures for Two Common Learning Algorithms 


In this section some common learning algorithms are analyzed in terms of preferential struc- 
tures. The gradient descent (GD) algorithm has an interpretation based on the standard 
Euclidean (or unbiased) preferential structure on R” while the exponential gradient algo- 
rithm is related to a preferential metric of the form (13). 


6.1 Gradient Descent Algorithm 


Consider the un-biased or Euclidean preferential structure given by 


Gw = In 
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the Euclidean metric. This metric is associated with a uniform (improper) prior product 
distribution. The Christoffel symbols are zero, since the metric entries are constant, and 
geodesics are given by solutions 


din 


which are of course just straight lines 
Ya = We + TVp. 


The gradient of the loss £ is simply grad£ = ÎE Thus, comparing with (5) it is easily 
verified that the GD algorithm is simply 


Wk+1 = Vane (Sk) 
OL $ 
= Wk — Ska (Yas Îr). 
W 


6.2 Exponentiated Gradient Algorithm 


In this subsection the EG (Exponentiated Gradient) algorithm is derived within the frame- 
work of preferential structures. 

Previous work (Kivinen and Warmuth, 1997, Hill and Williamson, 2001) has shown that 
the EG algorithm tends to perform well in the situation where only a few entries of the true 
parameter w, are non-zero. Considering the question in reverse one would heuristically wish 
to choose a preferential structure that emphasises regions where only a few co-ordinates are 
non-zero. We will choose such a structure and show the EG algorithm (almost) follows from 
such a choice. 

Let RY denote the positive cone in RY 


RY = {ue R^: u>O}. (24) 


Consider the following preferential metric defined on RX 





1 
mt 0 0 
al Ta (25) 
0 0 Ce 


According to Section 4.2 this metric is associated with an improper product distribution on 
RY of the form 


N 
pw) = [] ow), 
i=l 


where each ¢;(w) is given by 
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Thus, each component prior distribution is weighted more heavily near wê = 0 and thus 
the overall product density is weighted strongly near w = 0. (Some graphs illustrating this 
metric are given in Section A.) 

The diagonal structure of the preferential metric Gwu makes it particularly easy to com- 
pute the Christoffel symbols. Recalling (19) the Christoffel symbols are 


re. = ifi=j=p. 
1I 0 otherwise. 


Finally, recalling (20) the equation for the geodesic is 


d?-yP 1 {dy 2 
= = (+) =0, forp=1,...,N, (26) 





with initial conditions y(0) = w and 4 = V for arbitrary w € RN and V € T,,R%. The 
particular structure of the Riemannian metric ensures that the N second order ODEs for 
the co-ordinates of y are decoupled. It is easily verified by substitution that the solution of 
this equation for each co-ordinate is 


i i T yi 
y'(7) = w" exp (Zv ) ; (27) 
Consider the descent direction 
Vk = — diag (wi) > (Yes Jk) = -Gy aw’ (28) 


where wy denotes the kth iteration of the learning algorithm. Note that Vk #4 —grad£, with 
respect to the preferential structure Gw, however, V; is certainly a descent direction since 
diag(w) > 0 is a positive definite matrix for w € RY. Along with the geodesic equation 
obtained above for the preferential structure chosen this choice of descent direction in (22) 
leads to the EG algorithm (6) 


d oL z 
Wk+1 = Vw, —diag(w,) 22) (Sk) = diag(wk) exp. | —Sk5—(Ye, Ye) ) = (6). 
(wy. —diag (wx) 2) Dw 


Although Vk is not actually the negative gradient —gradZ it is, however, closely related. 
According to the development undertaken in Section 5 it may be preferable to choose 


Wk+1 = YV(wp,—gradL) (sk) i 


which would be equivalent to choosing 


ði 
Ow’ 


This choice results in a the learning algorithm that we will call the ‘natural EG algorithm’ 


Vk = —grad£ = -G,, 





‘ r OL ii 
Ukta = Vou-me) (si) = diag( wn) exp ( =srdiag (on) 22099). 
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This raises the guestion: what is the effective difference between (6) and (29), or is one 
to be preferred somehow. One remark we can offer in this regard is that (29) might be 
preferred in so far as it allows a comparison with other gradient algorithms which use the 
natural gradient and thus differ only in the choice of 4. It is also apparent by inspecting 
the two algorithms and their behaviour in simulation that (29) pays more attention to the 
prior. 


7. Link Functions and Flat Preferential Structures 


In this section the general properties of product preferential structures are studied. It is 
shown that the link function analysis used in recent literature (Kivinen and Warmuth, 
1998) to analyse the EG algorithm can be obtained as a direct generalization of normal 
co-ordinates with respect to the preferential structure (25). 

Consider a product preferential metric of the form 


pı (wt)? 0 bets 0 
Gal WA - (30) 
0 0 du(w)? 


where ¢; : R — R are positive definite functions ¢; > 0. Note that each ¢; := ¢;(w’) 
is chosen to depend only on its associated variable. This is the situation where Gw is a 
preferential metric associated with a product prior 


N x 
olw) = |] oi(w’). 
121 


This structure has some special conseguences for the geometry of learning algorithms derived 
according to the procedure outlined in Section 5. To distinguish between the preferential ge- 
ometry and the classical Euclidean geometry on RY we denote the preferentially structured 
space by RN. 

Denote the ijth entry of Gw by gij and the ijth entry of GZ! by g’? where the base 
point w is inferred from context. Thus, gi; = 0 = g” except when i = j and 


1 
ilw)? j 
Recalling (19) the Christoffel symbols are easily verified to be 


gi = bi(w')?, g“ = 


pa) gaga WA (31) 
a 0 otherwise. 
Thus the Christoffel symbols for a product preferential metric always have a diagonal struc- 


ture. Recalling (20) the general equation for a geodesic curve, then it is easily shown that 
the general geodesic equation in this case is a set of N decoupled second order ODEs 


L abi) (an? 
ey a 0) = 0 





(32) 
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As a consequence the general geodesics are made up of independent evolution equations in 
each of the co-ordinates. 

Equation 32 can be simplified to obtain a set of simple first order, single variable ODEs. 
Note that 


abil") 55 _ ddl) 


dyi a h 





where we now think of ¢;(t) = ¢;(7*(t)) as a function of t. Thus, the above equation for 
the geodesic becomes 


ht 
#4 C4 =o. 
YA 


Let 2" = $7 (ie. 2*(t) = pil (EY (2). Then 
Z = Qiy + oi = Qi (Ss +x) =0, 


where the factorization by 9; is always possible since the ¢; are positive definite functions. 
Substituting back into the definition of z’ and rearranging terms yields the ODE 





a = 
2(0) = ¢i(7*(0))¥'(0); (0) = u’; (0) = V’ (34) 


Lemma 5 Suppose ¢;(w) > 0 for allw E R. Then the solution of (33,34) is 
FE) = Dr (V Bafu) (35) 
where ®; = f ¢; (the indefinite integral of $). 


Proof We write ¢ = ¢; for simplicity. The condition on ¢ implies ® is invertible and 
thus (35) implies 


(10) = VV o(w) + P(w). 


Differentiating both sides we obtain 








000) ao) wo) (36) 
= AADA = Volw) (37) 
Volw) _ 6(0))400) 
AA E CTC) i 
Furthermore 4(0) = ®-!(®(w)) = w and 4(0) = $40 =v E 
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One can readily check (by replacing ®(w) by ®(w) + c and 07!(r) by ®-!(x —c)) that 
the constant of integration c effectively omitted in the definition of P does not change the 
solution y(t). 

By setting V = —grad&£, and since 


OL OL 
ee Oleg) eee F —2 1 —2 NIA 
grad£ = Go ðw diag(¢} (w ), sen On (w )) ðw 


(22) takes the general form 


j OL 
Wh = ®; : (a Ow? 








wt 


= +a) (39) 


wi =u} Qi (wy) 





When £ is the squared loss (4), 2& = —at.(y, — J) one obtains 


> Jwi 





spat —Y ; 
uja = 077 (BEE 0) 





di (wy,) (40) 
7.1 Rescaling of ¢(w) 
As a sanity check, now consider what happens if we replace ¢(w) by 
p(w) = Bow) (41) 
for some 8 > 0. Clearly we have 
O(w) = p(w) (42) 
and 
$- (x) = #7! (x/8) (43) 


Substituting (41), (42) and (43) into (40) we see that we recover the orginial algorithm by 
setting 3, = 8?sp. This makes sense when one recalls (see (30)): clearly Gy = 6?Gy. 

This simple analysis shows there is no intrinsic reason to insist that the distributions 
@(w) are normalized as proper probability distributions. 


8. Examples of Possible New Learning Algorithms 


In this section we examine some examples of learning algorithms generated by certain 
specific preferential structures. In all cases the underlying learning problem is that presented 
in Section 5 (cf. (22)). The algorithms are in fact simply (40) for different choices of ọ(w). 
The resulting ¢, 6 and 07! are collected together in table 1. 
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(OP) WIYWIOZTe 107 , 0 PUL GA Jo sootoyp ofqissod owog :T ALL 



































70 0G 
(uf /x0Z) ,_J40 (mo) ji oe zng? ko 
0 0 
(|x|70 = Du (aJu3s (Injo-? —_— 1) (mjuss 0 < 2% |m|o—? (70) dxo 
oe), L e/e(em + T) Atpneg 
m T W/€ 
(x)uey (m)ueyore wel Ayoned 
T 
M+ 
(x) yurs (m)qutsore aa z/ Ne 
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(%) 7-6 (n)a SUOT}PUOD (njo wy OST y 








331 


MAHONY AND WILLIAMSON 


8.1 EG (natural) 
Choose ¢(w) = 1/w and thus from (40) and Table 1 one obtains the algorithm: 
Weyer = exp (Ser (Ye — Pr) /wh + in(wh)) 


wi, exp (seri (YR oe Gn) /Wh) 


which is equivalent to (29) with £ being squared loss. 

This is the EG algorithm utilizing the natural gradient. It is only valid for w > 0. In 
order to use this algorithm to learn targets u that are not componentwise sign definite, the 
+ trick as presented by Kivinen and Warmuth (1997) could be used. 





8.2 EG(a) 
Choose ¢(w) = 1/w® (a Æ 1) and thus from (40) and Table 1 one obtains the algorithm: 





A Ai x 4 l-a 
uka = Po (sett — de) (nk) + (hy) (44) 
Like the EG (natural) algorithm, this algorithm is only valid for wi, > 0. It is easily verified 
that this algorithm approaches the behaviour of the GD algorithm as a — 0. 

Numerically implementing the algorithm it turns out that weights can tunnel through 
the infinite barrier at the origin due to the non-infintesimal step size used. In order to avoid 
problems which this causes (more likely for larger values of s), we have found it necessary 
to modify the algorithm to 


tk sgn(wp) |wh A + (wilay — Ge) 


sen (ta) |t| VUTA. 


i 
Wk+1 


Here € is some small number; we have used e = 2 x 107. 


8.3 EGclipped(c) 


In order to avoid the difficulties of the singularity at the origin with dpc, one can simply 
clip ¢ to c > 0 giving 


delu) = min (6 5): (45) 


One can show that 


cw Jw] < 4 
c = Í 4 
Pe(w) { sen(w)(1+In(clw|)) fol > (46) 
5-1 ; le] st 47 
c (w) =. sgn(x)el*!—1 |x| > 1 ( ) 


Cc 
Observe that on a bounded domain, for all c < co, ¢-(w) can be made into a proper 
distribution (by rescaling). The limit as c > oo is improper though. This is analogous 
to how improper priors are sometimes treated in the Bayesian literature, as the limit of a 
sequence of proper priors (Akaike, 1980). 
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8.4 expla) 


Choose ¢a(w) = exp(—a|w]), a > 0. By considering positive and negative cases separately 
one show 








alw) = sent) In (1 — e-alul) (48) 
z!(x) = SA In(1—alz|), — |a| < (49) 


8.5 Cauchy Product Distribution 
Choose 


; 1 

Oe) = TF wip 
which is the Cauchy distribution (unnormalized). The variance of the Cauchy distribution 
is not defined and it is a classic example of a distribution with heavy tails. The product 
distribution ¢(w) := Te ¢;(w*) is a proper p.d.f. on R. Since the Cauchy distribution 
does not have a singularity at w* = 0 then the preferential structure is defined on all RY 
and there is no need to use + algorithms like those developed for the EG algorithm (Kivinen 
and Warmuth, 1997). 

From (40) and Table 1 we obtain the algorithm: 





Wey = tan (Sk2i (Yr — Ge)(L + (wh)?) + arctan(wy)) . (50) 


8.6 Elementwise Link Functions and Flat Preferential Structures 


Let E1, E2,... , Ey be the unit vectors in R^. Let Yw, v) (s) denote the particular geodesic 
curve obtained as a solution of (33) for initial conditions Y(w,v)(0) = w and ¥w,v)(0) = V. 
Then one can define a map 


f: RY | RY 
FE) = Yuk, cin) 


The map f(x) provides a set of local co-ordinates for R known as normal co-ordinates 
(Boothby, 1986). As we show below the map f is closely related to the concept of link 
functions commonly used to analyse learning algorithms such as the EG algorithm. The 
local co-ordinate frames induced are denoted 


o 
Ox' 
For any two vector fields X,Y on RY then let VxY denote the action of the Levi-Civita 


connection of X on Y (Boothby, 1986, pp. 317). Since f is constructed from solutions to 
the geodesic equations then it follows directly that 
0 


V S ES E pga N: 


= df |, (Ei), i1=1,..., N, 
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In fact, using the structure of TE it is easily verified that 


o 
L EQ) ij=1,...,N. 
det Oxd 1 , 
This property is not true of general Riemannian manifolds and is important since it implies 
that the underlying geometry of RY is effectively Euclidean. In particular, taking a straight 
line £k + SkVg on R which is a geodesic in the Euclidean geometry then f maps this to a 
geodesic 


Vs) = f(k + SkVp) 


on RV. The fact that the base point 71, need not be 0 is crucial in this relationship since it 
implies that the structure is unchanged for translations in RV. This will only occur if the 
intrinsic ‘curvature’ of RY is zero. 

A further consequence of the flatness of R is that the normal co-ordinate mapping f 
is an isometry. That is that for any two vectors X,Y € T,R then 


XTY = (X,Y) = (df X, dfY) pox) = (af X)"G pay df Y. 


Consegeuntly, the mapping f preserves the metric distance given by length of curves. Since 
f is an isometry one may as well take a standard learning algorithm on the simple Euclidean 
space R given in local Euclidean co-ordinates x € RY (cf. Subsection 6.1) 


Le41 = Tk + sk Vy (51) 


and obtain its associated learning algorithm on the desired space directly by the mapping 
Wr = f (xp). Of course the descent direction Vp is the descent direction in local co-ordinates 
x and must map via df to a chosen descent direction Vg € T f(y) RN . Thus, 


df (Vg) := Vp. 


The function f is serving the same role as the link functions commonly used to analyse and 
extend the EG algorithm (Kivinen and Warmuth, 1998). 
For example consider the EG algorithm where f = exp., the vector exponential, is the 


standard link function used. Direct computations show df = diag(w,) exp.(X) = w and 
1 


hence V, = diag(w)~!Vz. Choosing Vi, = -Gu 92 according to (28) then it follows that 
V; = 92. The learning algorithm obtained on RY = RY from (51) is 


Wey = f (2k41) = f (Ek + 5k Vg) 


= exp. (zr + seli) = diag(exp .(2:)) exp .(s%Ve) 


; OL 
= diag(w,) exp (sey): 


Comparing with (6) it is seen that this is another to way to derive the EG algorithm. 
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Remark 6 In some recent literature (Warmuth and Jagota, 1997) the concept of link func- 
tion has been developed in a co-ordinate by co-ordinate context. Thus, a link function is 
considered to be a map 


f':ROR, 


where the same function f? := g is generally used for each co-ordinate. The development 
here is closely related since the product preferential structure ensures that the normal co- 
ordinates can always be decomposed into independent co-ordinate functions ft := Fa?) 
obtained as solutions to the decoupled geodesic equations (33). The unified view presented 
in this paper provides a way of generalizing the link function results to obtain a better 
understanding of learning algorithms. 


Remark 7 Gordon (1999a, section 8.9.2) has also presented a (quite different) interpre- 
tation of algorithms such as GD and EG via Bayesian priors. He makes use of conjugate 
priors (see Robert, 1994) and his framework is rather different to ours. We have been un- 
able to draw a formal connection between his work and ours. Furthermore, the machinery of 
conjugate priors is intrinsically restricted — it can not cover the range of prior distributions 
we can deal with. Whether there is a useful connection is left as an open problem. 


9. Comparing the Performance of Stochastic Descent Learning 
Algorithms. 


In this section the relative performance of two stochastic gradient descent learning algo- 
rithms derived with respect to diagonal preferential structure (associated with product 
distributions) are compared. The algorithms are compared for a target weight vector w* 
drawn from the true product prior distribution that we assume is associated with one of 
the algorithms. 

The relative performance of two learning algorithms must be compared according to 
some criterion that is both computable and relevant to the desired qualitative behaviour of 
the algorithms. Kivinen and Warmuth (1997) proposed the ‘mistake-bounded framework’ 
for analysing the EG algorithm. A more traditional analysis based on classical LMS anal- 
ysis techniques has also been considered for the EG algorithm (Hill and Williamson, 1999, 
2001). Both approaches attempt to quantify the global performance of the algorithm con- 
sidered. In this paper generic algorithms based on the stochastic gradient descent concept 
are considered. A stochastic gradient descent algorithm may be considered to be a sequence 
of Bayesian estimates of the true weight vector conditioned on a local neighbourhood of the 
previous estimate and based on the most recent data. That is, at each step one searches 
for the parameter wz, E€ Br (wr) that best describes the latest data received. The learn- 
ing rate są of a stochastic gradient algorithm is linked directly to the radius r of the ball 
which is used to condition the new estimate. A consequence of this perspective is that one 
should only seek to compare the relative local performance of stochastic gradient descent 
algorithms. It is to be hoped, however, that for most sensible classes of prior distributions 
good local performance will translate into good global performance. This justification, along 
with the practical simplicity and ease of implementation, motivates the use of stochastic 
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gradient descent algorithms in learning applications. Conversely, if one wishes to solve the 
problem optimally then it is necessary to return to a full Bayesian analysis.? 

All stochastic gradient descent algorithms (for sufficiently small learning rate) display 
the same qualitative asymptotic convergence properties (Solo and Kong, 1995). The asymp- 
totic error may differ depending on the relative sensitivity of the schemes and the particular 
learning rate used. High asymptotic sensitivity is usually linked to a fast learning rate and 
better transient performance of an algorithm. As a consequence asymptotic error analysis 
provides a poor measure of relative performance of two stochastic gradient descent learning 
algorithms. 

A tempting comparison of the transient performance between two learning algorithms is 
to take the expectation (with respect to the given prior distribution) of the rate of decrease 
of the loss function for the two algorithms across all possible weight vectors and all target 
vectors. Unfortunately, due to the local nature of the Bayesian interpretation of stochastic 
gradient algorithms a global average will not provide a suitable comparison, although, in 
certain specific cases the global average may well indicate the desired result. In this paper 
we will compare the expected decrease of the loss function over possible measurements and 
true weight vectors locally around each point in weight space. 

Assume that samples x, are drawn from a normalized uniform iid distribution with 
density @ and that the measurements yx are deterministic functions of zk. The expected 
value of the loss function over possible samples xz is 


vw) = E$ [Llyr Dad) (52) 
= 5 Oa 
2 Jgn Yk — Yk kJatk 
= TR — We, Tk) PO) 
= sw — Wa) ES [epr] (w = Wx). 


Since ¢ is normalized uniform iid then 
Ee [epr] = In 


where Iy is the N x N identity matrix. 
Let p and q be two proper product prior distributions on a compact subset 2 c RY of 
weight space 


N N 
p(w) =] [w aw) =] ow). (53) 
i=1 i—1 
Let Gp(w) := diag((¢5)?,... , (90°) and Gafw) := diag((¢j),... , (97 )?) be diagonal 


preferential structures associated with the product distributions p and q. 
We will compare the behaviour of the averaged learning algorithms 


Wk+1 = Mar ato Gagan) OH)» (54) 





3. Even if the reader does not completely buy the above argument, we have a backup: “we have performed 
a local analysis because a global analysis seems dauntingly difficult!” . 
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derived from Eq. 22 for the metric G;(w) equal to either Gp(w) or G,(w). The step-size st 
is either sk or sh depending on the algorithm. The step-size scaling factor is chosen to scale 
the relative volume induced by each probability distribution to equal a constant 

(eae (any volB, (wr) yy VolBy (wx) 


na aa el el (55) 


where volB,(0) denotes the (Euclidean) volume of a ball of radius r > 0. For r sufficiently 





small the probability of the event By (wk) (denoting the ball with respect to the preferential 
t 


structure 7) under the relevant prior distribution is normalized 








gk (sh) 
PBR = f wedde = TT gazes) = e 
sk Wk i=l YAP 


= volB,(0) _ (sh) N i 

vQ gw) PBs: (wk). 
Thus, the step-size scaling factor is adjusted so that, based on the prior information under 
which an algorithm is derived, there is a fixed probability that the true weight vector lies 
in the set in which the next update of the algorithm is chosen. The scaling factor r which 
is associated with the uniform distribution on the compact set Q provides a useful ‘knob’ 
with which to tune the performance of a class of learning algorithms. In the simulations 
(cf. Section 10) we compare two new algorithms with standard Gradient Descent (which 
has $(w) = 1) and thus choose sw, = skin (@new(w))!/. 

The following approximation of (54) holds for sufficiently small step-size sË 


wher = We — SEG; (we) w — w) + O (PIG), 


To bound the error due to the approximation it is necessary to reduce the scalar r that 
bounds maximum step length. Note that sË x r (cf. Eq. 55) and the above equation may 
be written 





Wki1 = Wk — sk Gy (we) (wk — wx) +O (r°) ; (56) 
where the constant associated with the O(r?) term scales as supyeg NGE (wi. 
Let 
Avy (we) = Oy (wes) — owe) (57) 


denote the decrease of the averaged loss function at step k+1 with respect to the preferential 
structure G;. From (56) we obtain 


Ay; (we) = ; (we — sFGi(w) i (wk — wa) w) SIWA w)? + O(r?) 


= =s} (we — we)? Gy(w) + (we — ws) + O(r?). 





Since G;(w) = diag {(4)*} is a diagonal preferential structure then we can write 


+ O(r?) (58) 
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Theorem 8 Consider the learning problem outlined in Section 2 restricted to a compact 
subset Q C R”. Assume that the samples are chosen according to a normalized uniform 
iid process and let p be the averaged loss function (52). Let p and q be two product prior 
distributions (Eqn’s 53) for the true weight vector w, with associated diagonal preferential 
structures Gp(w) and Gg(w). Assume that supyca [G3 (w)|l2 and supyea [G7  (w)ll2 are 
bounded from above. Let Ay be defined by (57). For any point wk E Q set 


plue) = diag ( f (wh = wi Ppl) ) 
If for all wg E Q 
P(wk) < glwy) (59) 


and 


L ir (10x) Zplwn)) < det (C7 w) 3000) ? (00 


then there exists r > 0 sufficiently small such that when sk and sh are chosen to satisfy (55) 
one has 


Ein, [Adp(wr)] < Et, [Ava(we)] < 0. (61) 


Proof For each wz € Q there exists an ri(wg) > 0 sufficiently small such that both 
Et. [Avp(we)] and Et. [Aw,_(we)| are negative. Since Q is compact there exists rı := 
infwea{ri(w)} > 0. Let 


F := Ef, o — Eù, [AVa lwa)! 


=- fe SD (ai kyle) \dws. + +s DU AE plw.)dv. + O(r 2) 
p 


k 
7 7 ~~ wi, — wi) p(w, dw, r?°). 
7 3 as a) yA k — Wy) P(ws)dws + O(r*) 


To improve readability the superscripts and subscripts k are dropped in the remainder 
of the proof and the base point w = wk is used. If no other argument is specified all 
probability distributions are evaluated at the point w = wx. Let 





eyes, SES i ty 
m = i J 0 > wi oo dus 


pen (o)? 
"8p (i) 
Observe that wi; > 0 for all i. Using these definitions then define 


= 1 
-¥n(1-2) 
JET di 
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where we consider F(a) as a function of the new variables {a;}. The variables a; may in 
turn be thought of as functions of the product prior distributions 01. Note that 


F(a) = F + O(r?). 


The approach taken is to show that for a fixed q = [| bi distribution then the result 
holds for all product distributions p = || $p satisfying the theorem conditions. The set of 
all such distributions is parameterized by the variables a; > 0, i =1,... ,.N. By inspection, 
it is easily verified that 


F(a) > —oo for a4, > 0, i=1,...,N. 


Since we need consider only the set fa; > 0: i = 1,...,N} then there are no positive 
asymptotes of F(a) and consequently F(a) must have a global supremum. 

Note that u; > 0, i = 1,... ,N are the scaled variances of the wł around the arbitrary 
reference point w’. Recalling Eq. 55 one has 





jia NIEP NA p es 
p= z = = 
x PIY Pe q 


This is the constraint on the variables a; introduced by the conditioning associated with 
the step-size selection. Using this constraint one may write 


N-1 1 r N-1 
F(a) = om (1-2) tan 1-=][ a]. 
oa di Pp 

i=1 gil 
Computing the partial derivative of F(a) with respect to a; yields 

N-1 
OF(a) _ i q I= aj 
Oa; hi (a;)? PNG ai + 





Thus, the critical points of F(a) are characterized by the condition 
H Pr H 
i N 
— = UN- aj = —, t1=1,...,N. 63 
a Ap U Jan RA 


In particular, there is a unique critical point of F(a). It follows that this critical point must 
be the global maximum of F(a) on the set fa; > 0|i=1,...,N}. Evaluating the function 
F at the critical point one obtains 


Frit (w) = F(acrit) + O(r’) 





N-1 
=o wi -(N-1* + uy - +0077) 


i=1 aN aN 
= H 
N 
=X m-N—+0( *) 
121 aN 
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where Fcrit(w) denotes the dependence on the base point w € Q of each critical value of 
F. Premultiplying the constraint in (62) by 1/ |]; and evaluating at the critical point one 
obtains 





Mau (ou) p 
Tw NA aT Tes oi 


Consequently, at a critical point one has 


It follows that 


N N i 1/N TE 
eof aly = Hi q 2 
Beat ) < dH Nun (U z) (2) + O( ) 
N N 1/N 1 
=S m-N (Tl) (2) + O(r”) 
i=l i=l 


Multiplying condition (60) in the theorem statement by s, and exploiting the diagonal 
structure of the metric and the covariance matrix 2, yields 


L sgte (G7 *(wr)Ep(we)) 





1 
< sa det (G7 (we) Zp(we)) N 
N 


1 





Z yy ea eee 


a ($4) 


e) 


Using this condition the value of Feri(w) may be overbounded by 


Forit(w) < Yn (: = (2) `) + O(r?) 


Applying Eq. 59 it follows that for all w € Q there exists a r2(w) > 0 such that Ferit(w) < 0. 
Set r2 = infwea{re(w)} > 0. Choose r < min{ri,r2}. It follows that Foit(w) < 0 
Et. [AYa(wk)] < 0 for all w < 0. This completes the proof. 


5 
Boa 
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Remark 9 The set Q used in Theorem 8 need not be the full set on which the probability 
distributions p and q are defined. In practice, for e > 0 sufficiently small, one may choose 


Qp = {w E | p(w) < q(w) — € and Eq. 60 holds} 


The set Qp is then the set of target weight vectors w, on which the learning algorithm based 
on the prior distribution p outperforms that based on the prior distribution q. 


It is of interest to consider in more detail the two Conditions 59 and 60. Equation 59 
is a condition on the local value of the prior distribution. Ignoring the other conditions 
of theorem then this condition states that if the relative probability that the true weight 
vector is close to the present estimate is low then the algorithm designed according to 
the true prior distribution out performs its competitor. This should lead the algorithm to 
converge more quickly into a region in which the relative probability is comparable. If the 
relative probability that the true weight vector is close to the present estimate is high then 
nothing may be said about the relative performance of the algorithms. It should be noted 
that the reverse implication on performance is not a consequence of the theorem since the 
average descent properties of the algorithms are conditioned with respect to the true prior 
distribution. The second condition (Eq. 60) provides a link between the global and local 
properties of the distribution and the preferential structure. Condition (60) can be replaced 
by the underlying condition 


N 
N 
q DA mi) 
> 
p NN a hi 





that is a sufficient condition to ensure that Ferit is negative (for sufficiently small r > 0). 
We do not have a good interpretation of this condition. 


10. Simulations 


In this section we present some simulation results. In order to do so, it was necessary to 
decide on an appropriate way to compare the different algorithms, and in particular how to 
choose the step size parameter sk. The options available to choose s include: 


e Following Amari (1998) one may argue that the asymptotic stability of the algorithm 
is an adaquate measure of performance. Thus, any sufficiently small step-size selection 
is satisfactory. 


e Pick the “optimal” są = s according to Kivinen and Warmuth (1997). The way they 
do this depends on knowledge of the sequence of examples and the true weight vector. 
Even assuming knowledge of the process generating the examples, and the true weight 
vector, this optimal choice will depend on the length of the training sequence. Whilst 
there are ways of dealing with the fact that this choice requires knowledge of things 
impossible to know (see Kivinen and Warmuth, 1997, Section 5.1), it does seem a 
difficult way to proceed. After all, as Kivinen and Warmuth (1997, Section 9.2) say 


“In applying a learning algorithm, one is usually not so much concerned 
with the cumulative loss as with the quality of the final hypothesis.” 
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pt the standard signal processing method of comparison (cf. Hill and Williamson, 


1999, 2001): determine steady-state Mean Square Error (MSE) as a function of s, 


and 


then choose s to achieve a fixed steady-state MSE. Then compare algorithms in 


terms of their speed of convergence. 














e Adopt the choice implied by equation 55. This is in fact what we do in the present 
paper. 
GD, mse= 2.4 
2 
1h 
2) 
& 0 
o 
z 
-1F 
-2 
0 500 1000 1500 2000 2500 3000 
Time 


Figure 1: 











Time 


Comparison of exp(a) (a = 0.9) algorithm with standard Gradient Descent. x, 
was drawn independently from a uniform distribution on [-4, aja , M = 1500. 
The step-size for the GD algorithm was scp = 0.0001. Step size for exp(a) 
chosen according to (55) (see text). The target had the non-zero values inferrable 
from the diagram (approximately 1.6, 1.3, 1.1, 0.8,0.25, 0.2, -1.5), with all the 
remaining values zero (a fraction of 0.005 of the dimensions of w, were non-zero). 
Gaussian noise of standard deviation 0.06 was added to the yọ sequence. Both 
algorithms were started from the initial condition (3000; KIWA 3000) . In the graphs, 
only the first 100 dimensions of wz have been plotted for the sake of clarity. (The 
remaining dimensions all had target values of zero.) The indicated Mean Square 
Errors (MSE) were estimated from the final fifth of the run. 
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Figure 2: Same as Figure 1 except with a = 0.1 


In the experiments reported, we used a fixed step size scp for the standard Gradient 
Descent algorithm and then used (55) to determine sẹ for the comparison algorithm. The 
value of sgp was somewhat arbitrary, but chosen to ensure a clear stability margin for the 


GD algorithm. l 
For the exp(a) algorithm, ¢(w) = T: ell, Thus from (55) 





i N 

ERR _ (sap)* TA 
Teel 1 

= sk va) = scape ™ Tia lwl = gape mllealla/N (65) 


Similarly for the EGclipped(c) algorithm, with ¢.(w) = JAM min(c, wy) we choose 


SEGclipped(c) = SaDGc(We) (66) 


With reference to Figure 1, it can be seen that by 3000 iterations both algorithms have 
reached a “steady-state” where each component is being jiggled around the true value by 
the added noise. The key difference between the GD algorithm and the Exp(0.9) is that 
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Figure 3: Same as Figure 1 except with a = 1.1 


the effective step size for the non-zero components is larger; or equivalently, the effective 
step size for the zero components is smaller, which is what leads to the smaller steady-state 
MSE even though the GD algorithm converges slower (taking around 2500 steps to reach a 
steady state, whereas Exp(0.9) reached steady state in around 1000 steps). 

The exp(a) and EGclipped(c) algorithms outperform the standard GD algorithm on 
the problems considered. This is not surprising given the choice of the prior. One can see 
that the convergence speed is qualitatively similar, but that the steady state MSE of the 
expla) and EGclipped(c) algorithms is rather smaller than that for the GD algorithm. It 
can also be seen that increasing c or a, leading to a more extreme prior distribution, leads to 
algorithms whose behaviour is noticeably different from the GD algorithm. Letting c > 00 
in EGclipped(c) leads to the (natural gradient) EG algorithm. We have observed that the 
singularity at the origin in ¢.(w) causes numerical difficulties. 

We also observed it was necessary to replace ®|!(a) given by (49) by 

z (z) = —sgn(x) In(—a2| + e)/a (67) 
where € was chosen as 107? in the experiments reported in order to avoid numerical prob- 


lems. 
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Figure 4: Comparison of the EGclipped(c) algorithm (c = 3), x, drawn iid uniformly on 
[—4,4]”, (M = 300), sap = 0.005. Step size for EGclipped(c) chosen according 


2 2 
to (55) (see text). The target had the non-zero values inferrable from the diagram, 


with all the remaining values zero (a fraction of 0.01 of the dimensions of w, were 
non-zero). Gaussian noise of standard deviation 0.06 was added to the yr se- 
quence. Both algorithms were started from the initial condition (3000 KAYA 7000) : 
In the graphs, only the first 100 dimensions of w; have been plotted for the sake of 
clarity. (The remaining dimensions all had target values of zero.) The indicated 
Mean Square Errors (MSE) were estimated from the final fifth of the run. 


These numerical difficulties would be absent for any ¢(w) satisfying 
cı < d(w) < c2 
for all w and some 0 < cy < C2 < ©. 


11. Conclusions 


We have shown how some new variants on the classical LMS algorithm can be interpreted in 
terms of a prior over the parameter space. The tools used to do so were based on a natural 
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Figure 5: Same as Figure 1 except with c= 10. 


Riemannian structure. The results complement those developed in the area of information 
geometry. The simulation experiments illustrate that the interpretation via a prior is easy 
to reconcile with the actual behaviour of the algorithms. Previous work (Kivinen and 
Warmuth, 1997, Hill and Williamson, 2001) has shown how the EG algorithm can perform 
well in real situations where the target weight vector wx is sparse. The viewpoint developed 
here may well serve as a means for fine tuning the venerable LMS algorithm to better exploit 
prior knowledge one may have in a real problem. An advantage of the proposed framework 
is that, in the case of product prior distributions, it is a simple matter to generate algorithms 
optimized to the prior information available. An example of such an application as well 
as a simplified approximation to the algorithms developed in the present paper is given by 
Martin et al. (2001). 


There are several directions for further research on this topic. The most obvious are 
to consider a wider range of loss functions and to see if there is a closer connection with 
the work on Bregman divergences for analyzing these algorithms. One can also envisage 
combining the framework proposed in the present paper with techniques from information 
geometry in order to draw stronger connections with recent work such as that of Amari 
(1998). Indeed, it is reasonable to ask if one can derive a suitable geometry and associated 
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Figure 6: Same as Figure 1 except with c = 2. 


natural gradient algorithms with respect to both a prior distribution on the weights and 
the likelihood function associated with measurement noise? 

Another direction is to explore the connection (if any) with the prequential approach 
Dawid (1984), which studies sequential prediction from a viewpoint associated with Bayesian 


statistics. 
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Appendix A. Background on Riemannian Geometry 


To obtain a better understanding of the subject it is necessary to enter into the details. An 
excellent quick overview of differential geometry and the basics of Riemannian metrics is 
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given in (Helmke and Moore, 1994, appendix C). Good introductory texts have been written 
by Boothby (1986) and Lee (1997). A deeper knowledge can be obtain from Spivak (1979). 
Murray and Rice (1993) treat similar material and also discuss the geometry of statistical 
inference. 

The modern form of Riemannian geometry provides a language in which the structure of 
curved spaces can be analysed. Riemann was motivated by the ongoing effort to understand 
the significance of Euclid’s fifth postulate during the first half of the nineteenth century. 
This question must have been an important topic of discussion amongst all mathematicians 
of the time, however, Riemann’s early work was in the field of geometric properties of 
analytic functions of complex variables, conformal mappings and connectivity of surfaces. 
In 1854 he submitted a dissertation for his Habilitation on representing complex functions 
using trigonometric series. He was also required to give a public dissertation and submitted 
three possible topics to his examining committee. The first two topics were subjects in 
which he was recognized as having made significant contributions. He clearly expected to 
be asked to speak on this work and, perhaps in wishing to impress Gauss, a member of 
his Habilitation committee, he chose Uber die Hypothesen welche der Geometrie zu Grunde 
liegen (On the hypotheses that lie at the foundations of geometry)‘ as his third topic. Gauss 
called his bluff and Riemann had only a few months in which to prepare material that would 
impress the Habilitation committee. He had never been one to restrict his claims to those 
for which he had a rigorous development and his lecture was a sweeping introduction to 
a vague form of a differentiable manifold, the concept of local quadratic forms, a general 
connection to the concept of curvature, and finally a discussion of the implications of these 
concepts to our understanding of the world we live in. He provided a first vision of modern 
differential geometry, Riemannian manifolds and general relativity wrapped up into a single 
incoherent view of the world. Fortunately, Gauss was present to grasp the importance of 
the material presented. An indication of the visionary nature of Riemann’s lecture is gained 
by noting that it took another sixty years before the concept of a differentiable manifold 
was properly developed in 1913 by Weyl (1923). 

The aim of Riemannian geometry is to provide an intrinsic understanding of geometry 
based on observations made within the manifold considered. This should be contrasted 
to an extrinsic interpretation of geometry such as is possible, for example, for a sphere 
embedded in R*. The sphere is clearly curved, (from the perspective of an observer in the 
‘flat’? Euclidean space R3), and it is simple enough to develop measures of this curvature 
such as Euler’s principal curvatures that involve extrinsic measurements. The problem is 
considerably less clear if one imagines a mythical being living within 2-dimensional space 
on the surface of the sphere. A valid intrinsic understanding of geometry would allow 
such a creature to make measurements to determine whether they were living on ‘flat’ 
Euclidean space compared to, for example, an extremely large sphere (eg. the polygonal 
creatures living in the world created by Abbot (1992)). Gauss certainly believed that this 
was possible and his work on the curvature of surfaces (Gauss, 1965) emphasizes that the 





4. A transcript of Riemann’s lecture was published in (Riemann, 1868). An English translation was made 
by Spivak and appears in volume 2 of his comprehensive treatise on geometry (Spivak, 1979). Our 
knowledge of this work came from the recent delightful book by McCleary (1994). 

5. The principal curvatures of a surface in R? are the maximal and minimal curvature of all curves obtained 
by intersecting the surface with a plane (Euler, 1760). 
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infinitesimal curvature of a surface (as measured by the product of the principal curvatures) 
is related to the difference between the sum of the angles of an infinitesimal triangle drawn 
at the point of interest from 180°. This is an intrinsic measure available to creatures living 
within the surface. 

The key concept of Riemannian geometry is the concept of independence of the measure 
of length (and volume) from the local coordinates used to measure position. Underlying 
this concept is the understanding that there are no ‘special’ coordinates for a manifold 
and consequently that a geometric property is intrinsic to that manifold if and only if it is 
invariant under coordinate transformations. The fundamental axiom that Riemann worked 
from was that the infinitesimal length of a curve should be a physical quantity independent 
of the mathematical representation of the manifold but quite possibly dependent on the 
point at which the measurement is taken. In modern language, the measure of length 
on the manifold must be invariant under changes in local coordinates. In order for this 
to be true, the explicit expression of the infinitesimal measure of length must depend on 
the coordinates used; otherwise they would not transform with changes in the coordinates. 
From this principle it it is possible to derive the expression 





ds = y, gij (x)dzidzj 


ij=1 


where s denotes the curve parameterization and the g;;(x) code the manner in which the 
length of the curve depends on the particular local coordinates {£1, £2,... , £n} used. The 
matrix G(x) = [gi;(x)]ij is called the Riemannian metric for the manifold considered. 
The properties of symmetry, non-degeneracy and positive definiteness of the matrix G(x) 
correspond to physical properties of measuring length, albeit the infinitesimal limit of length. 
Of course, infinitesimal length may be integrated along curves to obtain a classical distance 
metric on the space. A curve tracing out the shortest distance between any two points 
according to this metric is called a geodesic. Such curves would appear as straight lines to 
a creature living within the manifold. 

Using the above intuition is now possible to state the intrinsic difference between a 
curved Riemannian space and flat space. For Euclidean space there exists a ‘special’ coor- 
dinate system for which the infinitesimal length measure is expressed 


5 dzidzi. 
i=1 


That is, the metric is gi;(x) = 6;; (G(x) = I). Clearly, a manifold is intrinsically Euclidean 
(or ‘flat’) if there exists a set of coordinates for which the Riemannian metric is transformed 
to the identity matrix. Of course, this must hold equally at every point in the space. From 
symmetry there n(n — 1)/2 linearly independent functions that define gj;(x), there are n 
degrees of freedom available by altering the coordinate function, and consequently there are 
n(n — 1)/2 components of gij(x) that cannot be arbitrarily fixed by variation in the coor- 
dinate chart. For example, for any two dimensional Riemannian manifold one may choose 
coordinates (locally) such that gi1 = g22 = 1, however, the function gi2(a) = 921(2) will in 
general be non-zero and non-constant. The set of all 2-dimensional Riemannian manifolds 


ds 
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Figure 7: A grid of unit measure in the Euclidean coordinates associated with the EG- 
algorithm on the left maps to a grid of curvilinear coordinates in the weight 
space for the EG-algorithm on the right. Observe that the coordinate grid is 
compressed close to the axes corresponding to the property that a unit step of 
distance in this region only alters the actual weight vector w by a small amount. 


are (locally) parameterised by the function 912(7). Euclidean space R? corresponds to the 
case gi2(z) = 0. The independent functions 9;;(7) are linked to the curvature of the space. 

An important observation is that the measure of length is dependent on the point 
at which the measurement is made in the manifold. An instructive example taken from 
the body of the paper is provided by considering the geometry used to analyze the EG- 


algorithm. The metric matrix is G(w) = diag( gory Wa tary) (cf. Subsection 6.2) defined 
on the weight space wt > 0 for alli =1,... ,n. Transforming the coordinates according to 


u(w) := (In(w’),... , n(w”)) , 


one obtains the constant Riemannian metric G(u) = In. Thus, the Riemannian structure 
introduced for the EG-algorithm is Euclidean of ‘flat’. Nevertheless, the geodesics in the 
original weight space are not straight lines. 

It is instructive to plot two examples of the mapping between the Euclidean coordinates 
u(w) introduced above and original weight space in the local coordinates w (cf. Figures 
7 and 8). In Figure 7 it is obvious that in the weight space the unit grid derived from 
the Euclidean coordinates gets compressed close to the tth coordinate axis by a factor of 
VJ 1/(w')? = du), the improper probability distribution that was introduced to model 
the prior knowledge assumed for the EG-algorithm. Thus, taking a unit step in this region 
results in very small changes in the weight vector coefficient w’. The second figure (Figure 8) 
shows the curvilinear nature of the geodesics in the weight space compared to the linear 
geodesics in the Euclidean coordinates for the EG-algorithm. 

The application of Riemannian geometry to statistical learning theory undertaken in this 
paper has a minor variant from classical Riemannian geometry. In statistical learning theory 
the weight vector w lives in a ‘special’ coordinate frame, namely, the coordinate frame that 
is linked to the linear model class of the learning problem. Thus, a preferential structure is 
given by a Riemannian metric defined with respect to ‘special’ coordinates corresponding 
to the weight vectors w used in the linear model class. In these coordinates the measure of 
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0 0.5 1 1.5 2 


Figure 8: An analogous plot to that shown in Figure 7 except that the Euclidean coordinates 
have been rotated by 45°. This plot shows the curvilinear nature of the geodesic 
coordinates induced by the EG preferential structure in weight space. A creature 
living within the weight space equipped with the EG potential stucture would 
observe the curves on the right drawn as straight lines shown in the grid on the 
left. 


length is certainly not constant. Thus, the geodesics appear curved when expressed in the 
coordinates w and the parameterization strongly contributes to the properties of the learning 
algorithm. The key observation is that the performance of the algorithm is measured in the 
‘special’ weight vector coordinates. This is contrasted to classical Riemannian geometry 
where all local coordinate charts are equally valid. If one were to discard the weight vector 
coordinate representation and use, for example, the Euclidean parameterization u := u(w) 
for the EG-algorithm it would be necessary to reparameterize the model class to obtain the 
same learning problem (cf. Section 2) 


z= (exp(u),2), £ ER“, 


for u. Thus, in a learning problem it is both the preferential structure and the structure of 
the model class that combined define the geometry of the problem.® 

It remains to comment on the issues involved in obtaining measurements of the Rie- 
mannian geometry of a manifold using only intrinsic measurements. Since an intrinsic 
measurement is made within the space a direct measure of distance is not sufficient to de- 
termine the important off diagonal metric structure. A differential measure, i.e. how the 
measure of length changes as a displacement is made, will provide information that can be 
used to infer the metric structure of the manifold. Riemann showed that this approach was 
sufficient to intrinsically compute the curvature of a manifold equipped with a infinitesimal 
quadratic distance measure (or Riemannian metric). As a consequence the curvature of a 
Riemannian manifold can be computed from first derivatives of the metric functions g;;(«). 
The actual expressions in coordinate form are intricate and difficult to work with. It was 
Christoffel (1869) who introduced a formal representation of this structure in terms of a set 





6. An interesting consequence of this observation is that any non-singular parameterization of the learning 
problem x +> (f(u),x), (linear in the measurement x) can be represented as linear model class along 
with a preferential structure derived from the local coordinate chart w := f(u), u = f7'(w). 
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Figure 9: A triangle drawn in coordinated w of weight space. 





| Correct | Incorrect 
ZA 71.5° 108° 

















ZB 372 37° 
ZC 71.5° 108° 
| Total | 180° 253° 





Figure 10: A table showing the angles between vertices of the triangle shown in Figure 9. 


of symbols and transformation rules under change of coordinates. The Christoffel symbols 
can be used to compute intrinsic geometric properties of a Riemannian manifold such as 
curvature, parallel transport, geodesics, etc. In this paper, the Christoffel symbols are used 
simply to obtain an explicit ODE for the geodesic equations for the learning algorithms 
considered. 

A final remark on the measurement of angles on Riemannian manifolds is of interest. 
It was well understood by workers in the early nineteenth century that for 2-dimensional 
surfaces the sum of the internal angles of a triangle is linked to the curvature of the surface. 
Indeed, for a triangle the internal angles sum to a value that differs from 180° by a factor 
that depends only on the curvature and the area of the triangle. Of course the angles must 
be measured relative the Riemannian metric that defines the geometry. Thus, the angle 
between two vectors at a point in the space is 


Zu, Vv = arccos ( uTG (ndu 
(ul G(x)u)2 (ul G(x)u)2 





Consider the triangle drawn in Figure 9 constructed from geodesic curves for the ge- 
ometry introduced for the EG-algorithm. Earlier it was shown that a smooth change of 
coordinates exists that transforms the metric into the Euclidean metric and hence that the 
space is flat. Computing the angles of the three points according to formulae above yields 
the correct values shown in Table A. As expected the correct calculation leads to total 
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of 180° corresponding to the fact that the preferential structure for the EG-algorithm is 
flat. The incorrect values shown are the angles computed without using the Riemannian 
metric and correspond to the angles that one observes visually in Figure 9. Summing the 
incorrect angles yields something absurd. This example demonstrates the importance of 
the Riemannian metric in measuring angles as well as providing a measure of distance. 
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